Lecture 10 - Regression and Linear Models
Lecture 9: Review
Covered
- Correlation analysis: measuring relationships between variables
- The distinction between correlation and regression
- Simple linear regression: predicting one variable from another
- Estimating and interpreting regression parameters
- Testing assumptions and handling violations
- Analysis of variance in regression
- Model selection and comparison
Lecture 10: Overview
Linear regression:
- REGRESSIONS:
- Analysis of variance
- Explained variance
- Assumptions and diagnostics
- Dealing w violations
- Model II regression
- Robust regression
- Smoothing Regressions
Lecture 10: Linear Regression
Simple Linear Regression Model
Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).
The sample regression equation is:
\[\hat{Y} = a + bX\]
Where:
- \(\hat{Y}\) is the predicted value of Y
- a is the estimate of α (intercept) sometimes \(\beta_0\)
- b is the estimate of β (slope) sometimes \(\beta_1\)
Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.
Lecture 10: Linear Regression
Simple Linear Regression Model
Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).
The sample regression equation is:
\[\hat{Y} = a + bX\]
Where:
- \(\hat{Y}\) is the predicted value of Y
- a is the estimate of α (intercept) sometimes \(\beta_0\)
- b is the estimate of β (slope) sometimes \(\beta_1\)
Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.
Lecture 10: Linear Regression
Simple Linear Regression Model
- male lions develop more black pigmentation on their noses as they age.
- can be used to estimate the age of lions in the field.
Call:
lm(formula = age_years ~ proportion_black, data = lion_data)
Residuals:
Min 1Q Median 3Q Max
-2.5449 -1.1117 -0.5285 0.9635 4.3421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8790 0.5688 1.545 0.133
proportion_black 10.6471 1.5095 7.053 7.68e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.669 on 30 degrees of freedom
Multiple R-squared: 0.6238, Adjusted R-squared: 0.6113
F-statistic: 49.75 on 1 and 30 DF, p-value: 7.677e-08
Lecture 10: Linear Regression
Simple Linear Regression Model
The calculation for slope (b) is:
\[b = \frac{\sum_i(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_i(X_i - \bar{X})^2}\]Given: -
\(\bar{X} = 0.3222\)
\(\bar{Y} = 4.3094\)
\(\sum_i(X_i - \bar{X})^2 = 1.2221\)
\(\sum_i(X_i - \bar{X})(Y_i - \bar{Y}) = 13.0123\)
b = 13.0123 / 1.2221 = 10.647
Intercept (a):
\(a = \bar{Y} - b\bar{X} = 4.3094 - 10.647(0.3222) = 0.879\)
Making predictions:
To predict the age of a lion with 0.50 proportion of black on its nose:
\[\hat{Y} = 0.88 + 10.65(0.50) = 6.2 \text{ years}\]
Confidence intervals vs. Prediction intervals:
- Confidence interval: Range for the mean age of all lions with 0.50 black
- Prediction interval: Range for an individual lion with 0.50 black
Both intervals are narrowest near \(\bar{X}\) and widen as X moves away from the mean.
Lecture 10: Linear Regression - estimates of error and significance
In addition to getting estimates of population parameters (intercept - β0 , slope - β1)
want to test hypotheses about them
- This is accomplished by analysis of variance
- Partition variance in Y: due to variation in X, due to other things (error)
Lecture 9: Linear Regression - estimates of variance
Total variation in Y is “partitioned” into 3 components:
- \(SS_{regression}\): variation explained by regression
- difference between predicted values (ŷi ) and mean y (ȳ)
- dfs= 1 for simple linear (parameters-1)
- \(SS_{residual}\): variation not explained by regression
- difference between observed (\(y_i\)) and predicted (\(\hat{y}_i\)) values
- dfs= n-2
- \(SS_{total}\): total variation
sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))
dfs = n-1
Lecture 9: Linear Regression - estimates of variance
Total variation in Y is “partitioned” into 3 components:
- \(SS_{regression}\): variation explained by regression
- difference between predicted values (ŷi ) and mean y (ȳ)
- dfs= 1 for simple linear (parameters-1)
- \(SS_{residual}\): variation not explained by regression
- difference between observed (\(y_i\)) and predicted (\(\hat{y}_i\)) values
- dfs= n-2
- \(SS_{total}\): total variation
sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))
dfs = n-1
Lecture 10: Linear Regression - estimates of variance
Total variation in Y is “partitioned” into 3 components:
- \(SS_{regression}\): variation explained by regression
- GREATER IN C than D
- \(SS_{residual}\): variation not explained by regression
- GREATER IN B THAN A
- \(SS_{total}\): total variation
Lecture 10: Linear Regression - estimates of variance
Sums of Squares and degress of freedome are:
\(SS_{regression} +SS_{residual} = SS_{total}\)
\(df_{regression}+df_{residual} = df_{total}\)
- Sums of Squares depends on n
- We need a different estimate of variance
Lecture 10: Linear Regression - estimates of variance
Sums of Squares converted to Mean Squares
- Sums of Squares divided by degrees of freedom - does not depend on n
- \(MS_{residual}\): estimate population variation
- \(MS_{regression}\): estimate pop variation and variation due to X-Y relationship
- Mean Squares are not additive
Lecture 10: Linear Regression - Null Hypothesis
Regression typically tests null hypothesis that β1 = 0
- or no relationship between X and Y
Can test in two ways:
Using t-statistic:
\[t=\frac{b_1-\theta}{s_{b_{1}}}\]
- \(s_{b_{1}}\)= Standard error of slope estimate
- Bo= 0: t-test: \(t=\frac{b_o}{s_{b_{o}}}\)
- 1 parameter t-test, where testing whether β1 =0
- t-statistic test is more general
- R can provide both
- Can also ask whether β0 =0 using t-test
- or whether two regression lines are significantly different
Lecture 10: Linear Regression - Null Hypothesis
Regression typically tests null hypothesis that β1 = 0
- or no relationship between X and Y
Can test in two ways:
Using F-ratio:
\[F = \frac {MS_{regression}}{MS_{residual}}\]
- if β1 = 0, ratio will be = 1 otherwise >1
- compare F-ratio to df-specific F-distribution
- decide how likely obtain our F-ratio by chance
Lecture 10: Linear Regression - Explained variance
- Want to know how strong is association between X and Y
- Coefficient of determination (\(R^2\)): proportion of variation in Y explained by X
\[r^2 = \frac{SS_{regression}}{SS_{total}}=1-\frac{SS_{residual}}{SS_{total}}\]
- When more of variation is due to regression rather than ‘error’, \(R^2\) closer to 1
Lecture 10: Linear Regression - Explained variance
- Want to know how strong is association between X and Y
- Coefficient of determination (\(R^2\)): proportion of variation in Y explained by X
\[r^2 = \frac{SS_{regression}}{SS_{total}}=1-\frac{SS_{residual}}{SS_{total}}\]
- When more of variation is due to regression rather than ‘error’, \(R^2\) closer to 1
Lecture 10: Linear Regression - Explained variance
\[F = \frac{MS_{regression}}{MS_{residual}}\] \[r^2 = \frac{SS_{regression}}{SS_{total}}\]
summary(lion_model)
Call:
lm(formula = age_years ~ proportion_black, data = lion_data)
Residuals:
Min 1Q Median 3Q Max
-2.5449 -1.1117 -0.5285 0.9635 4.3421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8790 0.5688 1.545 0.133
proportion_black 10.6471 1.5095 7.053 7.68e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.669 on 30 degrees of freedom
Multiple R-squared: 0.6238, Adjusted R-squared: 0.6113
F-statistic: 49.75 on 1 and 30 DF, p-value: 7.677e-08
anova(lion_model)Analysis of Variance Table
Response: age_years
Df Sum Sq Mean Sq F value Pr(>F)
proportion_black 1 138.544 138.544 49.75 7.677e-08 ***
Residuals 30 83.543 2.785
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reporting results
“Lion age (years) could be predicted from nose spots (percentage) using the simple linear regression model age = 10.67 * proportion_black + 0.8749. Regression analysis showed that the slope of the relationship was significantly (at α=0.05) different than 0 (\(F_{1,30}\) = 49.93, p < 0.0001, R² = 0.6247).”
Note there is an adjusted R² - what is that - accounts for the number of predictors in your model - adjusts the R² value by penalizing the addition of variables that don’t improve the model fit significantly
The formula for adjusted R² is:
\[ R^2 = 1 - \frac{(1 - R²) × (n - 1)}{(n - p - 1)}\]
Where:
- n is the number of observations (32 lions)
- p is the number of predictors (1 = proportion_black)
\(R^2\) measures the proportion of variance in the dependent variable (age_years) that is explained by the independent variable (proportion_black).
Assumptions and diagnostics of regression
- Assumptions apply to observed values of Y and εi
- most can be assessed by looking at residuals (distance from predicted)
Linearity:
- relationship between X and Y in population is straight line
Check:
- examine biplot of Y on X
If violated:
- transform Y
- use polynomial or nonlinear regression
Assumptions and diagnostics of regression
Normality:
- y-values for each xi are normally distributed.
- OLS estimates moderately robust to violation
Check:
- are residuals normally distributed?
- qq plots, histogram of residuals, shapiro-wilk test
If violated:
- transform Y
- use Generalized Linear Model
Assumptions and diagnostics of regression
Homogeneity of variance:
- y-values for each xi have same variance.
- OLS estimates NOT robust to violation
Check:
- plot residuals against x-values or predicted values (ŷi)
If violated:
- transform Y
- use GLM
- weighted LS regression
Assumptions and diagnostics of regression
Independence:
- Y values from each xi do not influence each other
- Often violated with repeated measurements in time/space -> autocorrelation
Check: determine correlation coefficient bw adjacent residuals
If violate:
- ANOVA (grouping present)
- mixed model ANOVA
- time series
Assumptions and diagnostics of regression
Fixed X:
- xi are known values fixed by researcher (e.g., drug doses).
- Often not true in ecology.
If violated:
- not problem for H-testing, prediction, but
- error underestimated.
- Can use model II regression
Assumptions and diagnostics of regression
Outlier or Influence:
- how much each point affects slope (Cook’s D)
- Large Di (>1) indicates influential observation
Assumptions and diagnostics of regression
Residual plots
residuals vs. predicted y:
can be used to assess assumptions:
- linearity
- normality
- equal variance
- outliers
Dealing with violations
Weighted least squares:
- when variance unequal, can use WLS approach
- each point weighted by reciprocal of variance (points w large variance given less weight)
Robust regression:
- when distribution distinctly non-normal and/or large outliers
LAD:
- parameters estimated from non-squared residuals
- outliers not as influential
M-estimators:
- residuals have different weight depending on distance from mean
Rank-based: “if all else fails”
Model II regression
Fixed X is assumption of regular regression
what if X random (typical case)?
If goal is prediction (interpolation) then Model I is ok…
if goal is correct parameters and error estimates, may need to use Model II
Model II regression
Model II regression - Approach underused in ecology
- Model I regression will still perform well for H-tests
- underestimate true slope
- minimizes distance between points and line along both axes
- (vs. on Y only in OLS)
MA and RMA approach slightly different
Model II Regression
Detailed Explanation of Model II Regression Types
- Standardized Major Axis (SMA)
- SMA regression minimizes the product of the vertical and horizontal distances from the points to the regression line. It’s implemented in the smatr package with method=“SMA”. SMA is appropriate when the measurement scales of X and Y are different.
- Major Axis (MA)
- MA regression minimizes the perpendicular distances from the data points to the regression line. It’s implemented in the smatr package with method=“MA”. MA is appropriate when X and Y are measured in the same units.
- Reduced Major Axis (RMA)
- RMA regression (also called geometric mean regression) is available in the lmodel2 package. It produces a slope that is the geometric mean of the OLS regression slopes of Y on X and X on Y (specifically, it equals the OLS slope of Y on X multiplied by the sign of the correlation between X and Y, divided by the square root of the R² value).
Model II Regression
When to Use Each Method
OLS (Model I) - Use when:
- X is measured without error
- The research goal is predicting Y from X
- There’s a clear dependent variable
MA (Major Axis) - Use when:
- X and Y are measured in the same units
- Both variables have similar error variances
- The goal is to understand the symmetric relationship
SMA (Standardized Major Axis) - Use when:
- X and Y are measured in different units
- The goal is to understand the structural relationship
- You want to test for isometry or allometry in scaling studies
RMA (Reduced Major Axis) - Use when:
- The ratio of error variances is approximately equal to the ratio of the true variances
- Both variables contain measurement error
- Neither variable is clearly dependent or independent
Model II Regression
Key Differences in Results
The slopes of these methods will typically follow this pattern when the correlation coefficient is less than 1: OLS slope < MA slope < RMA slope < inverse of OLS (X on Y) slope This is particularly evident when the correlation between X and Y is weaker. As correlation approaches 1, the differences between methods diminish.
Model II Regression
Decision Tree
Here’s a simplified decision tree:
-Are X and Y measured with error? If No → Use OLS (Model I) -Are the errors in X and Y approximately equal? If Yes → Use MA -Are X and Y measured in different units/scales? If Yes → Consider SMA -Is the correlation between X and Y weak (<0.7)? If Yes → Method choice is critical; consider RMA -Are you uncertain about error structure? If Yes → RMA is a reasonable compromise
Remember that when the correlation between X and Y is very strong (r > 0.9), all methods will yield similar results, making the choice less critical. The differences between methods become more pronounced as the correlation weakens.
Finally, it’s often valuable to run multiple methods and compare the results. If they lead to different ecological or biological interpretations, this should be explicitly addressed in your discussion.
Model II Regression
Call:
lm(formula = y ~ x, data = data_ols_m2)
Residuals:
Min 1Q Median 3Q Max
-8.058 -3.498 -0.990 2.946 16.070
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1346 2.6584 1.179 0.241
x 2.8745 0.2617 10.982 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.858 on 98 degrees of freedom
Multiple R-squared: 0.5517, Adjusted R-squared: 0.5471
F-statistic: 120.6 on 1 and 98 DF, p-value: < 2.2e-16
Model II regression
Call: lmodel2(formula = y ~ x, data = data_ols_m2, range.y =
"relative", range.x = "relative", nperm = 99)
n = 100 r = 0.7427668 r-square = 0.5517025
Parametric P-values: 2-tailed = 9.070988e-19 1-tailed = 4.535494e-19
Angle between the two OLS regression lines = 8.317517 degrees
Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
A permutation test of r is equivalent to a permutation test of the OLS slope
P-perm for SMA = NA because the SMA slope cannot be tested
Regression results
Method Intercept Slope Angle (degrees) P-perm (1-tailed)
1 OLS 3.134600 2.874473 70.81773 0.01
2 MA -18.688419 5.059928 78.82062 0.01
3 SMA -6.805843 3.869953 75.51163 NA
4 RMA -8.131038 4.002664 75.97273 0.01
Confidence intervals
Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
1 OLS -2.140952 8.410152 2.355051 3.393894
2 MA -29.748586 -10.906922 4.280654 6.167542
3 SMA -12.339088 -1.965648 3.385234 4.424077
4 RMA -16.211541 -1.554125 3.344023 4.811882
Eigenvalues: 54.08817 1.502864
H statistic used for computing C.I. of MA: 0.001181286
2.5 % 97.5 %
(Intercept) -2.140952 8.410152
x 2.355051 3.393894
Call: sma(formula = y ~ x, data = data_ols_m2, method = "SMA")
Fit using Standardized Major Axis
------------------------------------------------------------
Coefficients:
elevation slope
estimate -6.805843 3.869953
lower limit -12.094381 3.385234
upper limit -1.517306 4.424077
H0 : variables uncorrelated
R-squared : 0.5517025
P-value : < 2.22e-16
Call: sma(formula = y ~ x, data = data_ols_m2, method = "MA")
Fit using Major Axis
------------------------------------------------------------
Coefficients:
elevation slope
estimate -18.688419 5.059928
lower limit -27.905282 4.280260
upper limit -9.471556 6.168337
H0 : variables uncorrelated
R-squared : 0.5517025
P-value : < 2.22e-16